{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Supervised sentiment: hand-built feature functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Fall 2020\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Feature functions](#Feature-functions)\n",
    "  1. [Unigrams](#Unigrams)\n",
    "  1. [Bigrams](#Bigrams)\n",
    "  1. [A note on DictVectorizer](#A-note-on-DictVectorizer)\n",
    "1. [Building datasets for experiments](#Building-datasets-for-experiments)\n",
    "1. [Basic optimization](#Basic-optimization)\n",
    "  1. [Wrapper for SGDClassifier](#Wrapper-for-SGDClassifier)\n",
    "  1. [Wrapper for LogisticRegression](#Wrapper-for-LogisticRegression)\n",
    "  1. [Wrapper for TorchShallowNeuralClassifier](#Wrapper-for-TorchShallowNeuralClassifier)\n",
    "  1. [A softmax classifier in PyTorch](#A-softmax-classifier-in-PyTorch)\n",
    "  1. [Using sklearn Pipelines](#Using-sklearn-Pipelines)\n",
    "1. [Hyperparameter search](#Hyperparameter-search)\n",
    "  1. [utils.fit_classifier_with_hyperparameter_search](#utils.fit_classifier_with_hyperparameter_search)\n",
    "  1. [Example using LogisticRegression](#Example-using-LogisticRegression)\n",
    "1. [Reproducing  baselines from Socher et al. 2013](#Reproducing--baselines-from-Socher-et-al.-2013)\n",
    "  1. [Reproducing the Unigram NaiveBayes results](#Reproducing-the-Unigram-NaiveBayes-results)\n",
    "  1. [Reproducing the Bigrams NaiveBayes results](#Reproducing-the-Bigrams-NaiveBayes-results)\n",
    "  1. [Reproducing the SVM results](#Reproducing-the-SVM-results)\n",
    "1. [Statistical comparison of classifier models](#Statistical-comparison-of-classifier-models)\n",
    "  1. [Comparison with the Wilcoxon signed-rank test](#Comparison-with-the-Wilcoxon-signed-rank-test)\n",
    "  1. [Comparison with McNemar's test](#Comparison-with-McNemar's-test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Overview\n",
    "\n",
    "* The focus of this notebook is __building feature representations__ for use with (mostly linear) classifiers (though you're encouraged to try out some non-linear ones as well!).\n",
    "\n",
    "* The core characteristics of the feature functions we'll build here:\n",
    "   * They represent examples in __very large, very sparse feature spaces__.\n",
    "   * The individual feature functions can be __highly refined__, drawing on expert human knowledge of the domain. \n",
    "   * Taken together, these representations don't comprehensively represent the input examples. They just identify aspects of the inputs that the classifier model can make good use of (we hope).\n",
    "   \n",
    "* These classifiers tend to be __highly competitive__. We'll look at more powerful deep learning models in the next notebook, and it will immediately become apparent that it is very difficult to get them to measure up to well-built classifiers based in sparse feature representations. It can be done, but it tends to require a lot of attention to optimization details (and potentially a lot of compute resources).\n",
    "\n",
    "* For this notebook, we look in detail at just two very general strategies for featurization: unigram-based and bigram-based. This gives us a chance to introduce core concepts in optimization. The [associated homework](hw_sst.ipynb) is oriented towards designing more specialized, linguistically intricate feature functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Set-up\n",
    "\n",
    "See [the previous notebook](sst_01_overview.ipynb#Set-up) for set-up instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from np_sgd_classifier import BasicSGDClassifier\n",
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.svm import LinearSVC\n",
    "import scipy.stats\n",
    "from nltk.tree import Tree\n",
    "import torch.nn as nn\n",
    "from torch_shallow_neural_classifier import TorchShallowNeuralClassifier\n",
    "import sst\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "SST_HOME = os.path.join('data', 'trees')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Feature functions\n",
    "\n",
    "* Feature representation is arguably __the most important step in any machine learning task__. As you experiment with the SST, you'll come to appreciate this fact, since your choice of feature function will have a far greater impact on the effectiveness of your models than any other choice you make. This is especially true if you are careful to optimize the hyperparameters of your models.\n",
    "\n",
    "* We will define our feature functions as `dict`s mapping feature names (which can be any object that can be a `dict` key) to their values (which must be `bool`, `int`, or `float`). \n",
    "\n",
    "* To prepare for optimization, we will use `sklearn`'s [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) class to turn these into matrices of features. \n",
    "\n",
    "* The `dict`-based approach gives us a lot of flexibility and frees us from having to worry about the underlying feature matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unigrams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A typical baseline or default feature representation in NLP or NLU is built from unigrams. Here, those are the leaf nodes of the tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def unigrams_phi(tree):\n",
    "    \"\"\"\n",
    "    The basis for a unigrams feature function. Downcases all tokens.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    tree : nltk.tree\n",
    "        The tree to represent.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    defaultdict\n",
    "        A map from strings to their counts in `tree`. (Counter maps a\n",
    "        list to a dict of counts of the elements in that list.)\n",
    "\n",
    "    \"\"\"\n",
    "    return Counter([w.lower() for w in tree.leaves()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "example_tree = Tree.fromstring(\"\"\"(4 (2 NLU) (4 (2 is) (4 enlightening)))\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'nlu': 1, 'is': 1, 'enlightening': 1})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unigrams_phi(example_tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bigrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bigrams_phi(tree):\n",
    "    \"\"\"\n",
    "    The basis for a bigrams feature function. Downcases all tokens.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    tree : nltk.tree\n",
    "        The tree to represent.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    defaultdict\n",
    "        A map from tuples to their counts in `tree`.\n",
    "\n",
    "    \"\"\"\n",
    "    leaves = [w.lower() for w in tree.leaves()]\n",
    "    left = [utils.START_SYMBOL] + leaves\n",
    "    right = leaves + [utils.END_SYMBOL]\n",
    "    grams = list(zip(left, right))\n",
    "    return Counter(grams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({('<s>', 'nlu'): 1,\n",
       "         ('nlu', 'is'): 1,\n",
       "         ('is', 'enlightening'): 1,\n",
       "         ('enlightening', '</s>'): 1})"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bigrams_phi(example_tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's generally good design to __write lots of atomic feature functions__ and then bring them together into a single function when running experiments. This will lead to reusable parts that you can assess independently and in sub-groups as part of development."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on DictVectorizer\n",
    "\n",
    "I've tried to be careful above to say that the above functions are just the __basis__ for feature representations. In truth, our models typically do represent examples as dictionaries, but rather as vectors embedded in a matrix. In general, to manage the translation from dictionaries to vectors, we use  `sklearn.feature_extraction.DictVectorizer` instances. Here's a brief overview of how these work:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start, suppose that we had just two examples to represent, and our feature function mapped them to the following list of dictionaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_feats = [\n",
    "    {'a': 1, 'b': 1},\n",
    "    {'b': 1, 'c': 2}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create a `DictVectorizer`. So that we can more easily inspect the resulting matrix, I've set `sparse=False`, so that the return value is a dense matrix. For real problems, you'll probably want to use `sparse=True`, as it will be vastly more efficient for the very sparse feature matrices that you are likely to be creating."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "vec = DictVectorizer(sparse=False)  # Use `sparse=True` for real problems!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fit_transform` method maps our list of dictionaries to a matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = vec.fit_transform(train_feats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I'll create a `pd.Datafame` just to help us inspect `X_train`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a    b    c\n",
       "0  1.0  1.0  0.0\n",
       "1  0.0  1.0  2.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X_train, columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see that, intuitively, the feature called \"a\" is embedded in the first column, \"b\" in the second column, and \"c\" in the third."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now suppose we have some new test examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_feats = [\n",
    "    {'a': 2, 'c': 1},\n",
    "    {'a': 4, 'b': 2, 'd': 1}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we have trained a model on `X_train`, then it will not have any way to deal with this new feature \"d\". This shows that we need to embed `test_feats` in the same space as `X_train`. To do this, one just calls `transform` on the existing vectorizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = vec.transform(test_feats)  # Not `fit_transform`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a    b    c\n",
       "0  2.0  0.0  1.0\n",
       "1  4.0  2.0  0.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X_test, columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common mistake with `DictVectorizer` is calling `fit_transform` on test examples. This will wipe out the existing representation scheme, replacing it with one that matches the test examples. That will happen silently, but then you'll find that the new representations are incompatible with the model you fit. This is likely to manifest itself as a `ValueError` relating to feature counts. Here's an example that might help you spot this if and when it arises in your own work: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ValueError: X has 4 features per sample; expecting 3\n"
     ]
    }
   ],
   "source": [
    "toy_mod = LogisticRegression()\n",
    "\n",
    "vec = DictVectorizer(sparse=False)\n",
    "\n",
    "X_train = vec.fit_transform(train_feats)\n",
    "\n",
    "toy_mod.fit(X_train, [0, 1])\n",
    "\n",
    "# Here's the error! Don't use `fit_transform` again! Use `transform`!\n",
    "X_test = vec.fit_transform(test_feats)\n",
    "\n",
    "try:\n",
    "    toy_mod.predict(X_test)\n",
    "except ValueError as err:\n",
    "    print(\"ValueError: {}\".format(err))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is actually the lucky case. If your train and test sets have the same number of features (columns), then no error will arise, and you might not even notice the misalignment between the train and test feature matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In what follows, all these steps will be taken care of \"under the hood\", but it's good to be aware of what is happening. I think this also helps show the value of writing general experiment code, so that you don't have to check each experiment individually to make sure that you called the `DictVectorizer` methods (among other things!) correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Building datasets for experiments\n",
    "\n",
    "The second major phase for our analysis is a kind of set-up phase. Ingredients:\n",
    "\n",
    "* A reader like `train_reader`\n",
    "* A feature function like `unigrams_phi`\n",
    "* A class function like `binary_class_func`\n",
    "\n",
    "The convenience function `sst.build_dataset` uses these to build a dataset for training and assessing a model. See its documentation for details on how it works. Much of this is about taking advantage of `sklearn`'s many functions for model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset = sst.build_dataset(\n",
    "    SST_HOME,\n",
    "    reader=sst.train_reader,\n",
    "    phi=unigrams_phi,\n",
    "    class_func=sst.binary_class_func,\n",
    "    vectorizer=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train dataset with unigram features has 6,920 examples and 14,828 features.\n"
     ]
    }
   ],
   "source": [
    "print(\"Train dataset with unigram features has {:,} examples and \"\n",
    "      \"{:,} features.\".format(*train_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `sst.build_dataset` has an optional argument `vectorizer`:\n",
    "\n",
    "* If it is `None`, then a new vectorizer is used and returned as `dataset['vectorizer']`. This is the usual scenario when training. \n",
    "\n",
    "* For evaluation, one wants to represent examples exactly as they were represented during training. To ensure that this happens, pass the training `vectorizer` to this function, so that `transform` is used, [as discussed just above](#A-note-on-DictVectorizer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_dataset = sst.build_dataset(\n",
    "    SST_HOME,\n",
    "    reader=sst.dev_reader,\n",
    "    phi=unigrams_phi,\n",
    "    class_func=sst.binary_class_func,\n",
    "    vectorizer=train_dataset['vectorizer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dev dataset with unigram features has 872 examples and 14,828 features\n"
     ]
    }
   ],
   "source": [
    "print(\"Dev dataset with unigram features has {:,} examples \"\n",
    "      \"and {:,} features\".format(*dev_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Basic optimization\n",
    "\n",
    "We're now in a position to begin training supervised models!\n",
    "\n",
    "For the most part, in this course, we will not study the theoretical aspects of machine learning optimization, concentrating instead on how to optimize systems effectively in practice. That is, this isn't a theory course, but rather an experimental, project-oriented one.\n",
    "\n",
    "Nonetheless, we do want to avoid treating our optimizers as black boxes that work their magic and give us some assessment figures for whatever we feed into them. That seems irresponsible from a scientific and engineering perspective, and it also sends the false signal that the optimization process is inherently mysterious. So we do want to take a minute to demystify it with some simple code.\n",
    "\n",
    "The module `np_sgd_classifier` contains a complete optimization framework, as `BasicSGDClassifier`. Well, it's complete in the sense that it achieves our full task of supervised learning. It's incomplete in the sense that it is very basic. You probably wouldn't want to use it in experiments. Rather, we're going to encourage you to rely on `sklearn` for your experiments (see below). Still, this is a good basic picture of what's happening under the hood.\n",
    "\n",
    "So what is `BasicSGDClassifier` doing? The heart of it is the `fit` function (reflecting the usual `sklearn` naming system). This method implements a hinge-loss stochastic sub-gradient descent optimization. Intuitively, it works as follows:\n",
    "\n",
    "1. Start by assuming that all the feature weights are `0`.\n",
    "1. Move through the dataset instance-by-instance in random order.\n",
    "1. For each instance, classify it using the current weights. \n",
    "1. If the classification is incorrect, move the weights in the direction of the correct classification\n",
    "\n",
    "This process repeats for a user-specified number of iterations (default `10` below), and the weight movement is tempered by a learning-rate parameter `eta` (default `0.1`). The output is a set of weights that can be used to make predictions about new (properly featurized) examples.\n",
    "\n",
    "In more technical terms, the objective function is \n",
    "\n",
    "$$\n",
    "  \\min_{\\mathbf{w} \\in \\mathbb{R}^{d}}\n",
    "  \\sum_{(x,y)\\in\\mathcal{D}} \n",
    "  \\max_{y'\\in\\mathbf{Y}}\n",
    "  \\left[\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')\\right] - \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y)\n",
    "$$\n",
    "\n",
    "where $\\mathbf{w}$ is the set of weights to be learned, $\\mathcal{D}$ is the training set of example&ndash;label pairs, $\\mathbf{Y}$ is the set of labels, $\\mathbf{cost}(y,y') = 0$ if $y=y'$, else $1$, and $\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y')$ is the inner product of the weights \n",
    "$\\mathbf{w}$ and the example as featurized according to $\\phi$.\n",
    "\n",
    "The `fit` method is then calculating the sub-gradient of this objective. In succinct pseudo-code:\n",
    "\n",
    "* Initialize $\\mathbf{w} = \\mathbf{0}$\n",
    "* Repeat $T$ times:\n",
    "    * for each $(x,y) \\in \\mathcal{D}$ (in random order):\n",
    "        * $\\tilde{y} = \\text{argmax}_{y'\\in \\mathcal{Y}} \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')$\n",
    "        * $\\mathbf{w} =  \\mathbf{w} + \\eta(\\phi(x,y) - \\phi(x,\\tilde{y}))$\n",
    "        \n",
    "This is very intuitive – push the weights in the direction of the positive cases. It doesn't require any probability theory. And such loss functions have proven highly effective in many settings. For a more powerful version of this classifier, see [sklearn.linear_model.SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier). With `loss='hinge'`, it should behave much like `BasicSGDClassifier` (but faster!).\n",
    "\n",
    "For the most part, the classifiers that we use in this course have a softmax objective function. The module [np_shallow_neural_classifier.py](np_shallow_neural_classifier.py) is a straightforward example. The precise calculations are a bit less transparent than those for `BasicSGDClassifier`, but the general logic is the same for both."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for SGDClassifier\n",
    "\n",
    "For the sake of our experimental framework, a simple wrapper for `SGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_basic_sgd_classifier(X, y):\n",
    "    \"\"\"\n",
    "    Wrapper for `BasicSGDClassifier`.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.array, shape `(n_examples, n_features)`\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    BasicSGDClassifier\n",
    "        A trained `BasicSGDClassifier` instance.\n",
    "\n",
    "    \"\"\"\n",
    "    mod = BasicSGDClassifier()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This might look like a roundabout way of just calling `fit`. We'll see shortly that having a \"wrapper\" like this creates space for us to include a lot of other modeling steps.\n",
    "\n",
    "We now have all the pieces needed to run experiments. And __we're going to want to run a lot of experiments__, trying out different feature functions, taking different perspectives on the data and labels, and using different models. \n",
    "\n",
    "To make that process efficient and regimented, `sst` contains a function `experiment`. All it does is pull together these pieces and use them for training and assessment. It's complicated, but the flexibility will turn out to be an asset. Here's an example with all of the default values spelled out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.725     0.780     0.752       941\n",
      "    positive      0.805     0.755     0.779      1135\n",
      "\n",
      "    accuracy                          0.766      2076\n",
      "   macro avg      0.765     0.768     0.766      2076\n",
      "weighted avg      0.769     0.766     0.767      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_basic_sgd_classifier,\n",
    "    train_reader=sst.train_reader,\n",
    "    assess_reader=None,\n",
    "    train_size=0.7,\n",
    "    class_func=sst.binary_class_func,\n",
    "    score_func=utils.safe_macro_f1,\n",
    "    verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few notes on this function call:\n",
    "    \n",
    "* Since `assess_reader=None`, the function reports performance on a random train–test split from `train_reader`. Give `sst.dev_reader` as the argument to assess against the `dev` set.\n",
    "\n",
    "* `unigrams_phi` is the function we defined above. By changing/expanding this function, you can start to improve on the above baseline, perhaps periodically seeing how you do on the dev set.\n",
    "\n",
    "* `fit_basic_sgd_classifier` is the wrapper we defined above. To assess new models, simply define more functions like this one. Such functions just need to consume an `(X, y)` pair constituting a dataset and return a model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for LogisticRegression\n",
    "\n",
    "As I said above, we likely don't want to rely on `BasicSGDClassifier` (though it does a good job with SST!). Instead, we want to rely on `sklearn` and our `torch_*` models. \n",
    "\n",
    "Here's a simple wrapper for [sklearn.linear.model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_classifier(X, y):\n",
    "    \"\"\"\n",
    "    Wrapper for `sklearn.linear.model.LogisticRegression`. This is\n",
    "    also called a Maximum Entropy (MaxEnt) Classifier, which is more\n",
    "    fitting for the multiclass case.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.array, shape `(n_examples, n_features)`\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear.model.LogisticRegression\n",
    "        A trained `LogisticRegression` instance.\n",
    "\n",
    "    \"\"\"\n",
    "    mod = LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        solver='liblinear',\n",
    "        multi_class='auto')\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And an experiment using `fit_softmax_classifier` and `unigrams_phi`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.756     0.745     0.751       973\n",
      "    positive      0.778     0.788     0.783      1103\n",
      "\n",
      "    accuracy                          0.768      2076\n",
      "   macro avg      0.767     0.766     0.767      2076\n",
      "weighted avg      0.768     0.768     0.768      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrapper for TorchShallowNeuralClassifier\n",
    "\n",
    "While we're at it, we might as well start to get a sense for whether adding a hidden layer to our softmax classifier yields any benefits. Whereas `LogisticRegression` is, at its core, computing\n",
    "\n",
    "$$\\begin{align*}\n",
    "y &= \\textbf{softmax}(xW_{xy} + b_{y})\n",
    "\\end{align*}$$\n",
    "\n",
    "the shallow neural network inserts a hidden layer with a non-linear activation applied to it:\n",
    "\n",
    "$$\\begin{align*}\n",
    "h &= \\tanh(xW_{xh} + b_{h}) \\\\\n",
    "y &= \\textbf{softmax}(hW_{hy} + b_{y})\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an illustrative example using `TorchShallowNeuralClassifier`, which is in the course repo's `torch_shallow_neural_classifier.py`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_nn_classifier(X, y):\n",
    "    mod = TorchShallowNeuralClassifier(\n",
    "        hidden_dim=100,\n",
    "        early_stopping=True,      # A basic early stopping set-up.\n",
    "        validation_fraction=0.1,  # If no improvement on the\n",
    "        tol=1e-5,                 # validation set is seen within\n",
    "        n_iter_no_change=10)      # `n_iter_no_change`, we stop.\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A noteworthy feature of this `fit_nn_classifier` is that it sets `early_stopping=True`. This instructs the optimizer to hold out a small fraction (see `validation_fraction`) of the training data to use as a dev set at the end of each epoch. Optimization will stop if improvements of at least `tol` on this dev set aren't seen within `n_iter_no_change` epochs. If that condition is triggered, the parameters from the top-scoring model are used for the final model. (For additional discussion, see [the section on model convergence in the evaluation methods notebook](#Assessing-models-without-convergence).)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another quick experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 26. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.20748527348041534"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.791     0.720     0.754      1011\n",
      "    positive      0.755     0.820     0.786      1065\n",
      "\n",
      "    accuracy                          0.771      2076\n",
      "   macro avg      0.773     0.770     0.770      2076\n",
      "weighted avg      0.773     0.771     0.770      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_nn_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### A softmax classifier in PyTorch\n",
    "\n",
    "Our PyTorch modules should support easy modification as discussed in [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb). Perhaps the simplest modification from that notebook uses `TorchShallowNeuralClassifier` to define a `TorchSoftmaxClassifier`. All you need to do for this is write a new `build_graph` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):\n",
    "\n",
    "    def build_graph(self):\n",
    "        return nn.Linear(self.input_dim, self.n_classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this function call, I added an L2 regularization term to help prevent overfitting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_torch_softmax(X, y):\n",
    "    mod = TorchSoftmaxClassifier(l2_strength=0.0001)\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 658. Training loss did not improve more than tol=1e-05. Final error is 0.44765447825193405."
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.774     0.747     0.761       997\n",
      "    positive      0.774     0.799     0.786      1079\n",
      "\n",
      "    accuracy                          0.774      2076\n",
      "   macro avg      0.774     0.773     0.773      2076\n",
      "weighted avg      0.774     0.774     0.774      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_torch_softmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using sklearn Pipelines\n",
    "\n",
    "The `sklearn.pipeline` module defines `Pipeline` objects, which let you chain together different transformations and estimators. `Pipeline` objects are fully compatible with `sst.experiment`. Here's a basic example using `TfidfTransformer` followed by `LogisticRegression`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_pipeline_softmax(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = LogisticRegression(max_iter=2000)\n",
    "    pipeline = Pipeline([\n",
    "        ('scaler', rescaler),\n",
    "        ('model', mod)])\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.753     0.767     0.760       936\n",
      "    positive      0.806     0.793     0.799      1140\n",
      "\n",
      "    accuracy                          0.781      2076\n",
      "   macro avg      0.779     0.780     0.780      2076\n",
      "weighted avg      0.782     0.781     0.781      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_pipeline_softmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pipelines can also include the models from the course repo. The one gotcha here is that some `sklearn` transformers return sparse matrices, which are likely to clash with the requirements of these other models. To get around this, just add `utils.DenseTransformer()` where you need to transition from a sparse matrix to a dense one. Here's an example using `TorchShallowNeuralClassifier` with early stopping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_pipeline_classifier(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = TorchShallowNeuralClassifier(early_stopping=True)\n",
    "    pipeline = Pipeline([\n",
    "        ('scaler', rescaler),\n",
    "        # We need this little bridge to go from\n",
    "        # sparse matrices to dense ones:\n",
    "        ('densify', utils.DenseTransformer()),\n",
    "        ('model', mod)])\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 43. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.360526978969574"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.771     0.753     0.762       985\n",
      "    positive      0.782     0.798     0.790      1091\n",
      "\n",
      "    accuracy                          0.777      2076\n",
      "   macro avg      0.777     0.776     0.776      2076\n",
      "weighted avg      0.777     0.777     0.777      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_pipeline_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Hyperparameter search\n",
    "\n",
    "The training process learns __parameters__ &mdash; the weights. There are typically lots of other parameters that need to be set. For instance, our `BasicSGDClassifier` has a learning rate parameter and a training iteration parameter. These are called __hyperparameters__. The more powerful `sklearn` classifiers and our `torch_*` models have many more such hyperparameters. These are outside of the explicitly stated objective, hence the \"hyper\" part. \n",
    "\n",
    "So far, we have just set the hyperparameters by hand. However, their optimal values can vary widely between datasets, and choices here can dramatically impact performance, so we would like to set them as part of the overall experimental framework."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### utils.fit_classifier_with_hyperparameter_search\n",
    "\n",
    "Luckily, `sklearn` provides a lot of functionality for setting hyperparameters via cross-validation. The function `utils.fit_classifier_with_hyperparameter_search` implements a basic framework for taking advantage of these options. It's really just a lightweight wrapper around `slearn.model_selection.GridSearchCV`.\n",
    "\n",
    "This corresponding model wrappers have the same basic shape as `fit_softmax_classifier` above: they take a dataset as input and return a trained model. However, to find the best model, they explore a space of hyperparameters supplied by the user, seeking the optimal combination of settings. \n",
    "\n",
    "Only the training data is used to perform this search; that data is split into multiple train–test splits, and the best hyperparameter settings are the one that do the best on average across these splits. Once those settings are found, a model is trained with those settings on all the available data and finally evaluated on the assessment data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example using LogisticRegression\n",
    "\n",
    "Here's a fairly full-featured use of the above for the `LogisticRegression` model family:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_with_hyperparameter_search(X, y):\n",
    "    \"\"\"\n",
    "    A MaxEnt model of dataset with hyperparameter cross-validation.\n",
    "\n",
    "    Some notes:\n",
    "\n",
    "    * 'fit_intercept': whether to include the class bias feature.\n",
    "    * 'C': weight for the regularization term (smaller is more regularized).\n",
    "    * 'penalty': type of regularization -- roughly, 'l1' ecourages small\n",
    "      sparse models, and 'l2' encourages the weights to conform to a\n",
    "      gaussian prior distribution.\n",
    "    * 'class_weight': 'balanced' adjusts the weights to simulate a\n",
    "      balanced class distribution, whereas None makes no adjustment.\n",
    "\n",
    "    Other arguments can be cross-validated; see\n",
    "    http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear_model.LogisticRegression\n",
    "        A trained model instance, the best model found.\n",
    "\n",
    "    \"\"\"\n",
    "    basemod = LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        solver='liblinear',\n",
    "        multi_class='auto')\n",
    "    cv = 5\n",
    "    param_grid = {\n",
    "        'C': [0.6, 0.8, 1.0, 2.0],\n",
    "        'penalty': ['l1', 'l2'],\n",
    "        'class_weight': ['balanced', None]}\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, basemod, cv, param_grid)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'C': 2.0, 'class_weight': None, 'penalty': 'l2'}\n",
      "Best score: 0.784\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.781     0.743     0.762       428\n",
      "    positive      0.763     0.800     0.781       444\n",
      "\n",
      "    accuracy                          0.772       872\n",
      "   macro avg      0.772     0.771     0.771       872\n",
      "weighted avg      0.772     0.772     0.772       872\n",
      "\n"
     ]
    }
   ],
   "source": [
    "softmax_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_with_hyperparameter_search,\n",
    "    class_func=sst.binary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that the \"Best params\" are found via evaluations only on the training data. The `assess_reader` is held out from that process, so it's giving us an estimate of how we will do on a final test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reproducing  baselines from Socher et al. 2013\n",
    "\n",
    "The goal of this section is to bring together ideas from the above to reproduce some of the the non-neural baselines from [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf). More specifically, we'll shoot for the root-level binary numbers:\n",
    "\n",
    "| Model              | Accuracy  | \n",
    "|--------------------|-----------|\n",
    "| Unigram NaiveBayes | 81.8      |\n",
    "| Bigram NaiveBayes  | 83.1      |\n",
    "| SVM                | 79.4      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the Unigram NaiveBayes results\n",
    "\n",
    "To start, we might just use `MultinomialNB` with default parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_unigram_nb_classifier(X, y):\n",
    "    mod = MultinomialNB()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.802     0.778     0.790       428\n",
      "    positive      0.792     0.815     0.804       444\n",
      "\n",
      "    accuracy                          0.797       872\n",
      "   macro avg      0.797     0.797     0.797       872\n",
      "weighted avg      0.797     0.797     0.797       872\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_unigram_nb_classifier,\n",
    "    class_func=sst.binary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This falls short of our goal by almost two percentage points, which is not encouraging about how we would do on the test set. However, `MultinomialNB` has a regularization term `alpha` that might have a significant impact given the very large, sparse feature matrices we are creating with `unigrams_phi`. In addition, it might help to transform the raw feature counts, for the same reason that reweighting was so powerful in our VSM module. The best way to try out all these ideas is to do a wide hyperparameter search. The following model wrapper function implements these steps: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_nb_classifier_with_hyperparameter_search(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = MultinomialNB()\n",
    "\n",
    "    pipeline = Pipeline([('scaler', rescaler), ('model', mod)])\n",
    "\n",
    "    # Access the alpha and fit_prior parameters of `mod` with\n",
    "    # `mod__alpha` and `model__fit_prior`, where \"model\" is the\n",
    "    # name from the Pipeline. Use 'passthrough' to optionally\n",
    "    # skip TF-IDF.\n",
    "    param_grid = {\n",
    "        'model__fit_prior': [True, False],\n",
    "        'scaler': ['passthrough', rescaler],\n",
    "        'model__alpha': [0.1, 0.2, 0.4, 0.8, 1.0, 1.2]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, pipeline, param_grid=param_grid, cv=5)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we run the experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__alpha': 1.2, 'model__fit_prior': False, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.790\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.807     0.775     0.791      1108\n",
      "    positive      0.805     0.833     0.819      1230\n",
      "\n",
      "    accuracy                          0.806      2338\n",
      "   macro avg      0.806     0.804     0.805      2338\n",
      "weighted avg      0.806     0.806     0.806      2338\n",
      "\n"
     ]
    }
   ],
   "source": [
    "unigram_nb_experiment_xval = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_nb_classifier_with_hyperparameter_search,\n",
    "    class_func=sst.binary_class_func,\n",
    "    train_reader=(sst.train_reader, sst.dev_reader))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we can create a new model with the parameters we found above, and train it on the combination of the train and dev datasets, on the assumption that this extra training data might be useful:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_optimized_unigram_nb_classifier(X, y):\n",
    "    pipeline = unigram_nb_experiment_xval['model']\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.852     0.798     0.824       912\n",
      "    positive      0.810     0.861     0.835       909\n",
      "\n",
      "    accuracy                          0.830      1821\n",
      "   macro avg      0.831     0.830     0.830      1821\n",
      "weighted avg      0.831     0.830     0.830      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "unigram_nb_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_optimized_unigram_nb_classifier,\n",
    "    class_func=sst.binary_class_func,\n",
    "    train_reader=(sst.train_reader, sst.dev_reader),\n",
    "    assess_reader=sst.test_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're right around the target of 81.8, so we can say that we reproduced the paper's result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the Bigrams NaiveBayes results\n",
    "\n",
    "For the bigram NaiveBayes mode, we can continue to use `fit_nb_classifier_with_hyperparameter_search`, but now the experiment is done with `bigrams_phi`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__alpha': 0.2, 'model__fit_prior': False, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.736\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.722     0.804     0.761      1118\n",
      "    positive      0.800     0.716     0.756      1220\n",
      "\n",
      "    accuracy                          0.758      2338\n",
      "   macro avg      0.761     0.760     0.758      2338\n",
      "weighted avg      0.763     0.758     0.758      2338\n",
      "\n"
     ]
    }
   ],
   "source": [
    "bigram_nb_experiment_xval = sst.experiment(\n",
    "    SST_HOME,\n",
    "    bigrams_phi,\n",
    "    fit_nb_classifier_with_hyperparameter_search,\n",
    "    class_func=sst.binary_class_func,\n",
    "    train_reader=(sst.train_reader, sst.dev_reader))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our evaluation, we'll train a new model on the union of the train and dev sets, using the best parameters from `bigram_nb_experiment_xval`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_optimized_bigram_nb_classifier(X, y):\n",
    "    pipeline = bigram_nb_experiment_xval['model']\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.796     0.754     0.775       912\n",
      "    positive      0.766     0.806     0.786       909\n",
      "\n",
      "    accuracy                          0.780      1821\n",
      "   macro avg      0.781     0.780     0.780      1821\n",
      "weighted avg      0.781     0.780     0.780      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "bigram_nb_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    bigrams_phi,\n",
    "    fit_optimized_bigram_nb_classifier,\n",
    "    class_func=sst.binary_class_func,\n",
    "    train_reader=(sst.train_reader, sst.dev_reader),\n",
    "    assess_reader=sst.test_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is below the target of 83.1, so we've failed to reproduce the paper's result. I am not sure where the implementation difference between our model and that of Socher et al. lies!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the SVM results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_svm_classifier_with_hyperparameter_search(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = LinearSVC(loss='squared_hinge', penalty='l2')\n",
    "\n",
    "    pipeline = Pipeline([('scaler', rescaler), ('model', mod)])\n",
    "\n",
    "    # Access the alpha parameter of `mod` with `mod__alpha`,\n",
    "    # where \"model\" is the name from the Pipeline. Use\n",
    "    # 'passthrough' to optionally skip TF-IDF.\n",
    "    param_grid = {\n",
    "        'scaler': ['passthrough', rescaler],\n",
    "        'model__C': [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, pipeline, param_grid=param_grid, cv=5)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__C': 0.4, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.787\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.801     0.771     0.786       428\n",
      "    positive      0.787     0.815     0.801       444\n",
      "\n",
      "    accuracy                          0.794       872\n",
      "   macro avg      0.794     0.793     0.793       872\n",
      "weighted avg      0.794     0.794     0.793       872\n",
      "\n"
     ]
    }
   ],
   "source": [
    "svm_experiment_xval = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_svm_classifier_with_hyperparameter_search,\n",
    "    class_func=sst.binary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dev set number is not very encouraging, but we've seen that the test set might actually be a bit easier, so we press on with the planned test set evaluation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_optimized_svm_classifier(X, y):\n",
    "    pipeline = svm_experiment_xval['model']\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.844     0.795     0.819       912\n",
      "    positive      0.806     0.853     0.828       909\n",
      "\n",
      "    accuracy                          0.824      1821\n",
      "   macro avg      0.825     0.824     0.824      1821\n",
      "weighted avg      0.825     0.824     0.824      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "svm_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_optimized_svm_classifier,\n",
    "    class_func=sst.binary_class_func,\n",
    "    train_reader=(sst.train_reader, sst.dev_reader),\n",
    "    assess_reader=sst.test_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Right on! This is quite a ways above the target of 79.4, so we can say that we successfully reproduced this result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Statistical comparison of classifier models\n",
    "\n",
    "Suppose two classifiers differ according to an effectiveness measure like F1 or accuracy. Are they meaningfully different?\n",
    "\n",
    "* For very large datasets, the answer might be clear: if performance is very stable across different train/assess splits and the difference in terms of correct predictions has practical importance, then you can clearly say yes. \n",
    "\n",
    "* With smaller datasets, or models whose performance is closer together, it can be harder to determine whether the two models are different. We can address this question in a basic way with repeated runs and basic null-hypothesis testing on the resulting score vectors.\n",
    "\n",
    "In general, one wants to compare __two feature functions against the same model__, or one wants to compare __two models with the same feature function used for both__. If both are changed at the same time, then it will be hard to figure out what is causing any differences you see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with the Wilcoxon signed-rank test\n",
    "\n",
    "The function `sst.compare_models` is designed for such testing. The default set-up uses the non-parametric [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) to make the comparisons, which is relatively conservative and recommended by [Demšar 2006](http://www.jmlr.org/papers/v7/demsar06a.html) for cases where one can afford to do multiple assessments. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#Wilcoxon-signed-rank-test).\n",
    "\n",
    "Here's an example showing the default parameters values and comparing `LogisticRegression` and `BasicSGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model 1 mean: 0.780\n",
      "Model 2 mean: 0.749\n",
      "p = 0.005\n"
     ]
    }
   ],
   "source": [
    "_ = sst.compare_models(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    stats_test=scipy.stats.wilcoxon,\n",
    "    trials=10,\n",
    "    phi2=None,  # Defaults to same as first required argument.\n",
    "    train_func2=fit_basic_sgd_classifier, # Defaults to same as second required argument.\n",
    "    reader=sst.train_reader,\n",
    "    train_size=0.7,\n",
    "    class_func=sst.binary_class_func,\n",
    "    score_func=utils.safe_macro_f1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with McNemar's test\n",
    "\n",
    "[McNemar's test](https://en.wikipedia.org/wiki/McNemar%27s_test) operates directly on the vectors of predictions for the two models being compared. As such, it doesn't require repeated runs, which is good where optimization is expensive. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#McNemar's-test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = utils.mcnemar(\n",
    "    unigram_nb_experiment['assess_dataset']['y'],\n",
    "    unigram_nb_experiment['predictions'],\n",
    "    bigram_nb_experiment['predictions'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "McNemar's test: 22.38 (p < 0.0001)\n"
     ]
    }
   ],
   "source": [
    "p = \"p < 0.0001\" if m[1] < 0.0001 else m[1]\n",
    "\n",
    "print(\"McNemar's test: {0:0.02f} ({1:})\".format(m[0], p))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
